Introduction

This report takes a look at the Financial Contributions made to Presenditial Campaigns in the state of New York for 2016. The primary dataset was downloaded from datasource, however I also created a list of cities in New York along with the latitude and longitude. This data was extracted from city_datasource

Data Structure

The Financial Contribution dataset (after cleaning) contains 167,902 and contains 23 variables, made up of:

Variable Name Meaning / Use
cmte_id Committee ID A 9-character alpha-numeric code assigned to a committee by the Federal Election Commission.
cand_id Candidate ID A 9-character alpha-numeric code assigned to a candidate by the Federal Election Commission.
cand_nm Candidate Name Recorded name of the candidate
contbr_nm Contributor Name Reported name of the contributor.
contbr_city Contributor City Reported city of the contributor
contbr_state Contributor State Reported state of the contributor
contbr_zip Contributor Zip Code Reported zip code of the contributor
contbr_employer Contributor Employer Reported employer of the contributor
contbr_occupation Contributor Occupation Reported occupation of the contributor
contb_receipt_amt Contribution Receipt Amount Reported contribution amount
contb_receipt_dt Contribution Receipt Date Reported contribution date
receipt_desc Receipt Description Additional information reported by the committee about a specific contribution
memo_cd Memo Code ‘X’ indicates the committee has provided additional text to describe a specific
memo_text Memo Text Additional information reported by the committee about a specific contribution
form_tp Form Type Indicates what schedule and line number the reporting committee reported a specific transaction
file_num File Number A unique number assigned to a report and all its associated transactions
tran_id Transaction ID A unique identifier for each transaction
election_tp Election Type This code indicates the election for which the contribution was made. EYYYY (election plus election year)

To help with the analysis I have added some additional fields to the dataset

Variable Meaning / Use
month used for grouping data by month
week used for grouping data by week
year used for grouping data by year
latitude stores the latitude based on the reported contribution city
longitude stores the longitude based on the reported contribution city
employment_status stores the employment status of each contributor based on the listed employer

Note: In order to get the latitude and longitude to match the city, I needed to match the city from the cities dataframe to the cities in the financial dataframe. However initially there was an issue with some of the names not matching up. I was able to use a Python script from a previous project to create matches and fix differences in the spelling of cities.

Once I had added the additional fields, I was able to start analysing the data and to help with this I created a couple of grouping / summaries of the data:

  • financial.group_by_city - used to plot and analyse the number and value of contributions by city
  • financial.group_by_candidate - used to plot and analyse the contributions made to different candidates
  • financial.group_by_employment - used to plot and analyse contributions made based on the status of employment

Candidates

The table below provides details for each of the candidates

cand_id cand_nm
P60008059 Bush, Jeb
P60005915 Carson, Benjamin S.
P60008521 Christie, Christopher J.
P00003392 Clinton, Hillary Rodham
P60006111 Cruz, Rafael Edward ‘Ted’
P60007242 Fiorina, Carly
P60007697 Graham, Lindsey O.
P80003478 Huckabee, Mike
P60008398 Jindal, Bobby
P60003670 Kasich, John R.
P60009685 Lessig, Lawrence
P60007671 O’Malley, Martin Joseph
P60007572 Pataki, George E.
P40003576 Paul, Rand
P20003281 Perry, James R. (Rick)
P60006723 Rubio, Marco
P60007168 Sanders, Bernard
P20002721 Santorum, Richard J.
P20003984 Stein, Jill
P80001571 Trump, Donald J.
P60006046 Walker, Scott
P60008885 Webb, James Henry Jr.

Top of Page

Univariate Analysis

What are the main feature(s) of interest in your dataset?

The main features of this dataset include the candidate and the value of the contributions that they received. The data below shows the break down of the contributions.

## Total Value of Contributions:  46072566 
## Total Number of Contributions:   167902 
## Average Value of Contribution: 274.4015 
## Maximum Contribution Value:       10800 
## Minimum Contribution Value:        0.08 
## Number of Candidates:                22 
## Number of Contributors:           35955

The plot below shows us that there are 2 candidates, P0003392 (Hillary Clinton) and P60006723 (Marco Rubio), that received the highest number of contributions.

The first plot in the group above shows that the data is right skewed with the majority of the contributions been less than or equal to $500. So in order to see the spread of data better I performed a log transform on the transaction amount, which can be seen in the second plot of the group.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

Other features in the dataset that would be useful to investigate are:

Contributor Occupation / Employer

When first looking at the grouping of the employer and occupation of the employer, I could see that there were:

## Employers:  14735 
## Occupations: 6732

This posed an issue with been able to determine if there were any distinct patterns, as some of these could have been similar occupations with different titles or the same employers with different names or recorded differently. So in order to determine if there were any patterns I created an additional variable in the dataset for employment status, based on the listed employer.

employment_status contribution_count contribution_value avg_contribution max_contribution
EMPLOYED 92693 29817871 321.68418 10800
SELF EMPLOYED 25169 6525750 259.27727 10800
NOT EMPLOYED 20650 1731867 83.86767 5400
RETIRED 15542 2525894 162.52052 5400
UNKNOWN 13848 5471183 395.08835 10800

From this plot we are able to see that the bulk of the contributions came from contributors that are employed at the time of the contribution.

Location of contributors

The plot below shows us that the location of contributions were generally spread out across the state of New York, with a couple of districts (Capital District and CentralNew York), that had a larger number of contributions. This is probably reflective of the population spread across the state of New York and where businesses are generally located.

Contributions over time (Receipt Date of the Contribution)

## [1] "Summary"
##         Min.      1st Qu.       Median         Mean      3rd Qu. 
## "2013-10-11" "2015-10-19" "2016-01-13" "2015-12-08" "2016-02-11" 
##         Max. 
## "2016-02-29"

From this plot we can see that the number of contributions has increased overtime. However the data and summary show an outlier in 2013. I believe that this could be related to possible data entry errors or data recorded later than the transaction occurred.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

When performing the initial review of the data, I found that there were a couple of outliers that affected the spread of the value of the contributions of the data. These outliers included:

  • A contribution of 3,686,000
  • A contribution of -5,400
  • Contributions recorded with dates outside of the last 12 months

To help see the spread of contributions made, I also used either a SQRT or LOG10 transfrom on the scales when I found that the data was too close together to analyse and interpret. This helped to see the data in more detail. The risk of this when looking at plots, there is a potential for the misinterpreting the data. In order to prevent this you need to look at the scales carefully.

Top of Page

Bivariate Analysis

Relationships between the main features

The first relationship I analysed was the relationship between the total value of the contributions made to each candidate. The next plot shows the total value of the contribution per candidate. This plot shows that whilst candidate P60006723 (Marco Rubio), had the highest number of contributions the candidate with the highest value of contributions was P0003392 (Hillary Clinton).

In order to see the spread of the value of the contributions I applied a SQRT coordinate transformation on the y-axis, which can be seen on the plot below.

The plot below shows the spread of contribution values and the number of times a particular value was made. From this plot we can start to see that as the contribution value increases the number of times the calculation is made decreases.

Relationships between other features

Contribution Values by Employment Status

The plots below show that the bulk of the total contributions were made by empoyed contributors, however the highest average contributions came from the unknown employment status. This employment status is made up of contributors who did not have an employer recorded against their contribution. The employed status also had the highest number of contributions, which resonates with the fact that this group has the highest contribution total, but not the highest average. The unknown group also has the lowest number of contributions which pushes up their average value of contributions.

Contribution Values by location

When looking at the contribution values by location we can see similar patterns that occurred with the count of contributions by location. The areas with the higher value correlate with the locations with the higher number of contributions.

Contribution Values over time

What was the strongest relationship you found?

The strongest relationship that I observed was the number of contributions been received over time in the dataset. For example as the campaigning process ramps up / progresses further the number of contributions increase. What I had exepcted, but didn’t see was the increase in the total amount been contributed each time period.

Top of Page

Multivariate Analysis

Relationships

During this part of the analysis, I decided to see how the timing or progression of the campaign impacted the amount and number of the contributions. The first part I wanted to investigate was seeing if different buckets / bins of contribution amounts increased or decreased more than others. In order to determine this I broke the contribution amounts into the relevant quartile and from the plot below I could see that whilst the value of contributions per month is increasing for each quartile, the quartile with the greatest increase is occurring for the lowest bucket (0.08-25).

The next analysis I wanted to look at here was the top 5 candidates and to see how their total contribution amounts varied over time.The plots below show that for the most of the top 5 candidates they all have ups and downs, with all candidates dropping around the holiday season. The candidate with the highest / most consistent trend of growth was P60007168 (Sanders, Bernard). The only candidate with the reverse trend was P60008059 (Bush, Jeb). The heatmaps further down below show a similar story for all candidates.

I also wanted to see if different candidates received contributions from different areas more than others, however the plots below show that the top 5 candidates were receiving contributions from similar areas. This may be based on the population spread in New York.

Top of Page

Reflection / Conclusion

After exploring the data I was able to draw the conclusion that as the campaign progresses the number of contributions increase for most of the candidates, however this does not have a direct impact on the total value of the contributions for each month. This occurs because the greatest growth in the number of contributions is occurring in the lowest quartile $0.08 - $25.00.

When looking at the location of where contributions are made from, I believe the greatest benefit in this would occur when looking at the USA overall, as we would be able to draw a link between the numbers of contributions and the popularity of each candidate by state.

Another aspect that could have been looked at which might have provided some benefit in analysing is to look at the gender and age of the contributors, as this could have helped to see if there were and groups of peole more likely to contribut to other candidates thn others.

The hardest part of this investigation was trying to determine which data to compare to each other and I found that there was greater beneift in grouping the contributions by groups, for example employment status.